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Abstract: We give a new algorithm for performing the distinct-degree factorization of 
a polynomial P(x) over GF(2), using a multi-level blocking strategy. The coarsest level of 
blocking replaces CCD computations by multiplications, as suggested by Pollard (1975), 
von zur Gathen and Shoup (1992), and others. The novelty of our approach is that a finer 
level of blocking replaces multiplications by squarings, which speeds up the computation in 
GF(2)[x]/P(a;) of certain interval polynomials when P(x) is sparse. 

As an application we give a fast algorithm to search for all irreducible trinomials x r +x s +l 
of degree r over GF(2), while producing a certificate that can be checked in less time than 
the full search. Naive algorithms cost 0{r 2 ) per trinomial, thus 0(r 3 ) to search over all 
trinomials of given degree r. Under a plausible assumption about the distribution of factors 
of trinomials, the new algorithm has complexity 0(r 2 (log r) 3 / 2 (log log r) 1 ^ 2 ) for the search 
over all trinomials of degree r. Our implementation achieves a speedup of greater than a 
factor of 560 over the naive algorithm in the case r = 24036583 (a Mersenne exponent). 

Using our program, we have found two new primitive trinomials of degree 24036583 over 
GF(2) (the previous record degree was 6972593). 

Key-words: Amortized complexity, distinct degree factorization, finite field, irreducible 
trinomial, Mersenne exponent, polynomial factorization, primitive trinomial 
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Un algorithme multi-etage pour la factorisation en degres distincts 

de polynomes 

Resume : Nous proposons un nouvel algorithme pour la factorisation en degres distincts 
d'un polynome P(x) sur GF(2), via une strategie multi-etage. Le niveau superieur remplacc 
des calculs de pgcd par des multiplications, comme suggere par Pollard (1975), von zur 
Gathen et Shoup (1992), et d'autres auteurs. L'originalite de notre approche tient dans un 
niveau inferieur, qui remplace des multiplications par des carres, ce qui accelere le calcul de 
certains polynomes intervalles sur GF(2)[x]/ P(x), quand P(x) est creux. 

Comme application nous exhibons un algorithme rapide cherchant tous les trinomes 
irrcductiblcs x r + x s + 1 de degre r sur GF(2), tout en produisant un certificat qui peut etre 
verifie plus rapidement. Les algorithmes nai'fs coutent 0(r 2 ) par trinome, soit 0(r 3 ) pour 
tous les trinomes de degre r. Sous une hypothese naturelle sur la distribution des facteurs 
de trinomes, le nouvel algorithme a une complexite 0(r 2 (logr) 3 / 2 (loglogr) 1 / 2 ) pour tester 
tous les trinomes de degre r. Notre implantation est 560 fois plus rapide que l'algorithme 
naif dans le cas r = 24036583 (exposant de Mersenne). 

Avec notre programme, nous avons trouve deux nouveaux trinomes primitifs de degre 
24036583 sur GF(2), le precedent record etant de degre 6972593. 

Mots-cles : Complexite amortie, corps fini, exposant de Mersenne, factorisation en 
degres distincts, factorisation de polynome, trinome irreductible, trinome primitif 



A MULTI-LEVEL BLOCKING DISTINCT DEGREE 
FACTORIZATION ALGORITHM 



1. Introduction 

The problem of factoring a univariate polynomial P(x) over a finite field F often arises 
in computational algebra [TTJ [T2] . An important case is when F has small characteristic 
and P(x) has high degree but is sparse, that is P{x) has only a small number of nonzero 
terms. 

To simplify the exposition we restrict attention to the case where F = GF(2) and P{x) 
is a trinomial 

P(x) = x r + x s + 1, r > s > 0, 

although the ideas apply more generally and should be useful for factoring sparse polynomials 
over fields of small characteristic. 

Our aim is to give an algorithm with good amortized complexity, that is, one that works 
well on average. Since we are restricting attention to trinomials, we average over all trino- 
mials of fixed degree r. 

Our motivation is to speed up previous algorithms for searching for irreducible trinomials 
of high degree [H O H3]- For given degree r, we want to find all irreducible trinomials 
x r +x s + l. 

In our examples the degree r is a Mersenne exponent, i.e., 2 r — 1 is a Mersenne prime. 
In this case an irreducible trinomial of degree r is necessarily primitive. In general, without 
the restriction to Mersenne exponents, we would need the prime factorisation of 2 r — 1 in 
order to test primitivity (see e.g., [TP])- 

We are only interested in Mersenne exponents r — ±1 mod 8, because in other cases 
Swan's theorem O [20] HI] rules out irreducible trinomials of degree r (except for s = 2 or 
r — 2, but these cases are usually easy to handle: for example if r = 13466917 or 20996011 
we have r = 1 mod 3, so x r + x 2 + 1 is divisible by x 2 + x + 1). 

Mersenne exponents can be found on the GIMPS website [22]. At the time of writing, 
the five largest known Mersenne exponents r satisfying the condition r = ±1 mod 8 are 
r = 6972593, 24036583, 25964951, 30402457 and 32582657. In the smallest case r = 6972593, 
a primitive trinomial was found by Brent, Larvala and Zimmermann [S] using an efficient 
implementation of the naive algorithm. However, it was not feasible to consider the larger 
Mersenne exponents r using the same algorithm, since the time complexity of this algorithm 
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is roughly of order r 3 , and the next case r = 24036583 would take about 41 times longer 
than r — 6972593. With the new "fast" algorithm described in this paper we have been able 
to find two primitive trinomials of degree r — 24036583 in less time than the naive algorithm 
took for r — 6972593. The speedup over the naive algorithm for r — 24036583 is about a 
factor of 560. 

If x r + x s + 1 is reducible then we want to provide an easily-checked certificate of reducibil- 
ity The certificate can simply be an encoding of an irreducible factor / of x r + x s + 1. We 
choose the factor / of smallest degree d > 0. In case there are several factors of equal 
smallest degree d, we give the one that is least in lexicographic order, e.g., x 3 + x + 1 is 
preferred to x 3 + x 2 + 1 . 

1.1. Distinct degree factorization. Our basic algorithm performs distinct degree factor- 
ization [Hllinifll]. That is, if P(x) has several factors of the same degree d, the algorithm 
will produce the product of these factors. The Cantor-Zassenhaus algorithm is used to split 
this product into distinct factors. This is cheap because the product usually consists of just 
one irreducible factor or is a product of irreducible factors of small (equal) degree. 

In the complexity analysis we only consider the time required to find one nontrivial factor 
(it will be a factor of smallest degree) or output "irreducible" , since that is what is required 
in the search for irreducible trinomials. 

1.2. Factorization over GF(2). It is well-known that x 2d + x is the product of all irre- 
ducible polynomials of degree dividing d. For example, 

a; 2 ' 3 + x = x(x + l)(x 3 + x + l)(x 3 + x 2 + 1). 

Thus, a simple algorithm to find a factor of smallest degree of P{x) is to compute GCD(a; 2 + 
x, P(x)) for d = 1,2,... The first time that the GCD is nontrivial, it contains a factor of 
minimal degree d. If the GCD has degree > d, it must be a product of factors of degree d. If 
no factor has been found for d < r/2, where r = deg(P(x)), then P(x) must be irreducible. 

Some simplifications are possible when P(x) = x r + X s + 1 is a trinomial over GF(2) with 
r or s odd (otherwise P(x) is trivially reducible): 

(1) We can skip the case d = 1 because a trinomial can not have a factor of degree 1. 

(2) Since x r P(l/x) = x r + x r ~ s + 1, we only need consider s < r/2. 

(3) We can assume that P(x) is square-free. 

(4) By applying Swan's theorem, we can often show that the trinomial under consid- 
eration has an odd number of irreducible factors; in this case we only need check 
d < r/3 before claiming that P(x) is irreducible. 

2. Complexity of the algorithm 

Note that x 2 should not be computed explicitly; it is much better to compute x 2 mod 
P(x) by repeated squaring. The complexity of squaring modulo a trinomial of degree r is 
only S(r) — 0(r) bit-operations. 
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2.1. Complexity of polynomial multiplication and squaring. As well as performing 
GCD computations we need to perform multiplications in GF(2)[x]/P(x), and an important 
special case is squaring a polynomial modulo P(x), so we first consider the bit-complexity 
of these operations. 

Multiplication of polynomials of degree r over GF(2) can be performed in time M(r) = 
0(r log r log log r). We have implemented an algorithm of Schonhage [TH] that achieves 
this bound. The algorithm uses a radix-3 FFT and is different from the better-known 
Schonhage-Strassen algorithm [17]. We remark that the log log r term in the time-bound 
for the Schonhage-Strassen algorithm has been reduced by Fiirer [9], but it is not clear if a 
similar idea can be used to improve Schonhage's algorithm [16) . In any event the log log r 
term comes from the number of levels of recursion and is a small constant for the values of r 
that we are considering. 

In practice, Schonhage's algorithm is not the fastest unless r is quite large. We have 
also implemented classical, Karatsuba and Toom-Cook algorithms that have M(r) = 0(r a ), 
1 < a < 2, since these algorithms are easier to implement and are faster for small r. Our 
implementations of the Toom-Cook algorithms TC3 and TC4 are based on recent ideas of 
Bodrato [T]. 

For brevity we assume that r is large and Schonhage's algorithm is used. On a 64-bit 
machine the crossover versus TC4 occurs near degree r = 108000. 

In the complexity estimates we assume that M(r) is a sufficiently smooth and well- 
behaved function. 

By Squaring we mean squaring a polynomial of degree < r and reduction mod P{x). 
Squaring in G¥(2)[x]/ P(x) can be performed in time S(r) = O(r) <C M(r) (assuming, as 
usual, that P(x) is a trinomial). Our algorithm takes advantage of the fact that squaring is 
much faster than multiplication. 

Where possible we use the memory-efficient squaring algorithm of Brent, Larvala and 
Zimmcrmann 4 , which in our implementation is about 2.2 times faster than the naive 
squaring algorithm. 

2.2. Complexity of GCD. For GCDs we use a sub-quadratic algorithm that runs in time 
G(r) = ©(M(r)logr). More precisely, 

G(2r) = 2G(r) + Q(M(r)), 

so for a > 1, 

M(r) = 0(r Q ) =S> G(r) = 6(M(r)), 

and 

M(r) = 6 (r log r log log r) =*> G(r) = 6 (M(r) log r). 

In practice, for r w 2.4 x 10 7 and our implementation on a 2.2 Ghz Opteron, S(r) w 0.005 
second, M(r) « 2 seconds, G{r) « 80 seconds, so M(r)/S(r) w 400, and G(r)/M(r) « 40. 

2.3. Avoiding GCD computations. In the context of integer factorization, Pollard [IS] 
suggested a blocking strategy to avoid most GCD computations and thus reduce the amor- 
tized cost; von zur Gathen and Shoup 12 applied the same idea to polynomial factorization. 
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The idea of blocking is to choose a parameter I > and, instead of computing 
GCB(x 2d +x,P{x)) for de[d',d' + £), 

compute 

GCB(pe(x 2d ' , x), P(x)), 
where the interval polynomial pi{X, x) is defined by 

e-i 

p t (x,x) = n (x 23 +x) . 

3=0 

In this way we replace I GCDs by one GCD and £ — 1 multiplications mod P(x). 

The drawback of blocking is that we may have to backtrack if P(x) has more than one 
factor with degree in the interval [d',d' + £), since the algorithm produces the product of 
these factors. Thus I should not be too large. The optimal strategy depends on the expected 
size distribution of factors and the ratio of times for GCDs and multiplications. 

2.4. Multi-level blocking. Our (apparently new) idea is to use a finer level of block- 
ing to replace most multiplications by squarings, which speeds up the computation in 
GF(2)[x]/ P(x) of the above interval polynomials. The idea is to split the interval [d', d' + £) 
into k > 2 smaller intervals of length m over which 

m — 1 m 

(1) Pm(X,x) = J] (X 2J + x) =J2z m ~ j S3,m(X), 

3=0 3=0 

where 

(2) s j<m (X)= J2 Xk > 

0<fc<2 m , w(k)=j 

and w(k) denotes the Hamming weight of k, that is the number of nonzero bits in the binary 
representation of k. 

For example, for m = 3, we have: 

p m (X, x) = x 3 + x 2 (X 4 + X 2 + X) + x{X 6 + X 5 + X 3 ) + X 7 , 

where s . 3 (X) = 1, fli j3 (JC) = X 4 + X 2 + X, s 2 , 3 ( x ) = X 6 + X 5 + X 3 , and s 3 , 3 (X) = X 7 . 
Note that 

s j}Tn (X 2 ) = s jtm {Xf in GY{2)[x]/P{x). 

Thus, p m {x 2 ,x) can be computed with cost m 2 S(r) if we already know Sj_ m (x 2 ) for 
< j < m. (The constant polynomial So, m (X) = 1 is computed only once.) 

Continuing the example with m = 3, and assuming that we know s\^(x 2 ), S2,z{x 2 ), 
and s 3 ^(x 2d 3 ), squaring each of these m — 3 times gives si,3(a; 2d ), S2,3{x 2d ), and S3^(x 2d ), 
from which we can easily get p 3 (x 2 , x) using the sum in Eq. (fT]). 

In this way we replace m — 1 multiplications and m squarings — if we used the product 
in Eq. ([T]) — by m 2 squarings. Each Sj. m , < j < m, requires m squarings to be shifted 
from argument x 2 to argument x 2 . The summation in Eq. |T]) costs only 0(mr), which 



INRIA 



A Multi-level Blocking Distinct Degree Factorization Algorithm 



7 



.(*) 



is negligible. Choosing m « yjM(r)/S{r) (about 20 if M(r)/S{r) w 400), the speedup over 
single-level blocking is about m/2 w 10 (not counting the cost of CCDs). 

Von zur Gathen and Gerhard [HJ p. 1685] suggested using the same idea with m — 2 (thus 
reducing the number of multiplications by a factor of two), but did not consider choosing 
an optimal m > 2. 

At first sight initialization of the polynomials Sj m (X) for X = x might appear to be 
expensive, since the definition ^ involves 0(2 m ) terms. However, the polynomials Sj iin {X) 
satisfy a "Pascal triangle" recurrence relation 

with boundary conditions 

if j > m > 0, 

1 if m > j = 0. 

Using this recurrence, it is easy to compute Sj <m (x) mod P(x) for < j < m in time 0(m 2 r). 
Thus, the initialization is cheap. 

To summarise, we use two levels of blocking: 

(1) The outer level replaces most GCDs by multiplications. 

(2) The inner level replaces most multiplications by squarings. 

(3) The parameter m sa y/M(r)/S(r) is used for the inner level of blocking. 

(4) A different parameter I — km is used for the outer level of blocking. 

For example, suppose 5 = 1/400, M = 1, G = 40 (where we have normalised so M = 1). 
We could choose I = 80 and m = 20. With no blocking, the cost for an interval of length 
80 is 80G + 80S" = 3200.2; with 1-level blocking the cost is G + 79M + 805 = 119.2; with 
2-level blocking the cost is G + 3M + 16005 = 47.0. 

2.5. Sieving out small factors. We define a small factor to be one with degree d < 
h log 2 r, so 2 d < y/r. The constant | in the definition is arbitrary and could be replaced by 
any fixed constant in (0, 1). A large factor is a factor that is not small. 

It would be inefficient to find small factors in the same way as large factors. Instead, let 
D = 2 d - 1, r' = r mod D, s' = s mod D. Then 

P{x) = x r + x s + 1 = x r ' + x s ' + 1 mod {x D - 1), 
so we only need compute 

GCD(/'+/ + l,x D - 1). 

Because r',s' < D < ^/r, the cost of finding small factors is negligible (both theoretically 
and in practice), so can be neglected. 

2.6. Outer level blocking strategy. The blocksize in the outer level of blocking is £ = km. 
We take a linearly increasing sequence of block sizes 

k = k Q j for j = 1,2,3,..., 

where the first interval starts at about logr (since small factors will have been found by 
sieving) . 
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The choice k — koj leads to a quadratic polynomial for the interval bounds; other possi- 
bilities are discussed by von zur Gathen and Gerhard [TT] . 

In principle, using the data that we have obtained on the distribution of degrees of smallest 
factors of trinomials (see and assuming that this distribution is not very sensitive to 
the degree r, we could obtain a strategy that is close to optimal. However, the choice koj 
with suitable fco is easy to implement and not too far from optimal. The number of GCD 
and sqr/mul operations is usually within a factor of 1.5 of the minimum possible in our 
experiments. 



In order to predict the expected behaviour of our algorithm, we need to know the expected 
distribution of degrees of smallest irreducible factors. From Swan's theorem [21] , we know 
that there are significant differences between the distribution of factors of trinomials and 
of all polynomials of the same degree. Our complexity estimates are based on the heuristic 
assumption that this difference is not too large, in a sense made precise by Hypothesis 13.11 

Hypothesis 3.1. Over all trinomials x r + x s + 1 of degree r over GF(2), the probability -Kd 
that a trinomial has no nontrivial factor of degree < d, 1 < d < r , is at most c/d, where c 
is a constant. 

Hypothesis 13.11 implies that there are at most c irreducible trinomials of degree r. This 
is probably false, as there may well be a sequence of exceptional r for which the number 
of irreducible trinomials is unbounded. Thus, we may need to replace the constant c in 
Hypothesis 13.11 bv a slowly-growing function c(r). Nevertheless, in order to give realistic 
complexity estimates that are in agreement with experiments, we assume below that Hy- 
pothesis 13.11 is correct. Under this assumption we use an amortized model to obtain the 
total complexity over all trinomials of degree r. 

From Hypothesis 13.11 the probability that a trinomial does not have a small factor (as 
defined in fgSJl is 0(1/ log r). 

TableCQgives the observed values of dir d for r = 3021377, r = 6972593, and r = 24036583. 
The maximum values for each r are given in bold. The table shows that the values of dix^ 
are remarkably stable for small d, and bounded by 4 for large d (this is because there are 
four irreducible trinomials of degree 3021377 and also four of degree 24036583, when we 
count both trinomials x r + x s + 1 and their reciprocals x r + x r ~ s + 1). 

3.1. Consequences of the hypothesis. Define pk = TTd-i — Kd to be the probability that 
the smallest nontrivial factor / of a randomly chosen trinomial has degree d = deg(/). In 
order to estimate the running time of our algorithm, we use the following Lemma, which 
gives the expectation Ep of dP . 

Lemma 3.2. If (3 > is constant and Hvpothesis lS. 1\ holds, then 



3. Distribution of degrees of factors 




0(1) if(3<l, 
O(logr) if (3 = 1, 
O^" 1 ) if /3 > I. 
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Table 1. dird for various degrees r. 



d 


r = 3021377 


r = 6972593 


r = 24036583 


2 


1 


333 


1.333 


1 


333 


3 


1 


429 


1.429 


1 


429 


4 


1 


524 


1.524 


1 


524 


5 


1 


536 


1.536 


1 


536 


6 


1 


598 


1.598 


1 


598 


7 


1 


600 


1.600 


1 


600 


8 


1 


667 


1.667 


1 


667 


9 


1 


642 


1.642 


1 


642 


10 


1 


652 


1.652 


1 


652 


100 


1 


763 


1.771 


1 


770 


1000 


1 


783 


1.756 


1 


786 


10000 


1 


946 


1.873 


1 


786 


100000 


1 


986 


1.606 


1 


880 


279383 


1 


480 


2.084 


1 


813 


1000000 


1 


324 


1.147 


1 


831 


10000000 








1 


664 


r-l 


4.000 


2.000 


4.000 



Proof. We use summation by parts. Note that a trinomial has no factor of degree 1, so 
Pl = and 7To = 7Ti = 1. Thus 

E 



< 



< 

and the result follows. 

The following Lemma gives a stronger result in the case (3 < 1. 
Lemma 3.3. // < (3 < 1, < D < r, and Hvpothesis \3. 1\ holds, then 

j2^ Pd = 0(D^). 

d=D 
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Kd-l — T^d) 
d=l d=l 
r-l 

d=l 

1 + ~ ^ (by Hypothesis [H 

/■r-l \ 
-2 



l + O £V 



=1 / 

□ 
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Proof. The proof is similar to that of Lemma l3.2l We end with the upper bound 

2. - d + D ^-i- 

d=D 

From Hypothesis I3.lt itd-i = 0(1/ D), and the sum over d is 0(D^^ 1 ), so the result 
follows. □ 

4. Expected cost of sqr/mul and GCD 

Recall that the inner level of blocking replaces m multiplications by m squarings and 
one multiplication, where the choice m m yj M(r) / S(r) makes the total cost of squarings 
about equal to the cost of multiplications. 

For a smallest factor of degree d, the number of squarings is m(d + 0(Vd)), where the 
0(\/d) term follows from our choice of outer-level blocksizes (see t )2.6p . Averaging over all 
trinomials of degree r, the expected number of squarings is 

O ( m Y,(d + 0(Vd)) Pd 

\ d<r/2 

and from Lemma 13.21 this is 0(m log r). Thus, the expected cost of sqr/mul operations per 
trinomial is 

O (S(r) log rx/M(r)/S(r)^ = O (logry/M(r)S(r)j 

(3) = o(r(logr) 3 / 2 (loglogr) 1/2 V 

If we used only a single level of blocking, then the cost of multiplications would dominate that 
of squarings, with an expected cost per trinomial of O (logrM(r)) = O (r(logr) 2 log log r). 

(|3|) is correct as r — > oo. However, in practice, at least for r < 6.4 x 10 7 , our implemen- 
tation of Schonhage's FFT-based polynomial multiplication algorithm [TB] calls a different 
multiplication routine (usually TC4) to perform smaller multiplications, rather than recur- 
sively calling itself. TC4 has exponent a' — ln(7)/ln(4) ss 1.4, so the effective exponent for 
FFT multiplication is a = (1 + a')/2 w 1.2 > 1. In this case, the expected cost of sqr/mul 
operations per trinomial is 

(4) O (logjVM(r)S(r)) = O^ 1 ^' 2 logr) = 0(r x l " logr) 

4.1. Expected cost of GCDs. Suppose that P(x) has a smallest factor of degree d. The 
number of GCDs required to find the factor, using our (quadratic polynomial) blocking 
strategy, is at least 1, and 0(y/d) if d is large. By Hypothesis 13. 1[ the expected number of 
GCDs for a trinomial with no small factor is 

i + o ( £ d^ Pd 

\ log 2 r<2d<r 
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and by Lemma 13.31 this is 



Wk>g?V ' 

Thus the expected cost of GCDs per trinomial is 

(5) C>(G(r)/ log r) = 0{M{r)) = 0(r log r log log r). 




|(SJ| is asymptotically less than the expected cost (J3J) of sqr/mul operations. However, if 
M(r) = 0(r a ) with a > 1, then the expected cost of GCDs is 0(r a /logr), which is 
asymptotically greater than the expected cost Q of sqr/mul operations. Note the expected 
cost of GCDs does not depend on whether we use one or two levels of blocking. 
For r « 2.4 x 10 7 , GCDs take about 65% of the time versus 35% for sqr/mul. 

4.2. Comparison with previous algorithms. For simplicity we use the O notation which 
ignores log factors. For example, M(r) = 0(r). 

The "naive" algorithm, as implemented by Brent, Larvala and Zimmcrmann [4j and 
earlier authors, takes an expected time 0(r 2 ) per trinomial, or 0(r 3 ) to cover all trinomials 
of degree r. 

The single- level blocking strategy and the new algorithm both take expected time 0(r) 
per trinomial, or 0(r 2 ) to cover all trinomials of degree r. 

In practice, the new algorithm is faster over the naive algorithm by a factor of about 160 
for r = 6972593, and by a factor of about 560 for r = 24036583. For r = 24036583, where 
sqr/mul operations take 35% of the total time in the new algorithm, and the corresponding 
speedup is about 10, this gives a global speedup of more than 4 over the single-blocking 
strategy. 

4.3. Some details of our implementation. We first implemented the 
2-level blocking strategy in NTL 18J. To get full efficiency, we rewrote all critical routines 
and tuned them efficiently on the target processors. Our squaring routine implements the 
algorithm described in [¥| , which is more than twice as fast as the corresponding optimized 
NTL routine for trinomials. Our multiplication routine implements Toom-Cook 3-way, 4- 
way, and Schonhage's algorithm [16 . We also improved the basecase multiplication code; 
more details concerning efficient multiplication in GF(2)[x] will be published in [5]. Finally, 
we implemented a subquadratic GCD routine, since NTL only provides a classical GCD for 
binary polynomials. 

4.4. Primitive trinomials. The largest published primitive trinomial is 



found by Brent, Larvala and Zimmcrmann [?] in 2002 using a naive (but efficiently imple- 
mented) algorithm. 

In March-April 2007, we tested our new program by verifying the published results on 
primitive trinomials for Mersenne exponents r < 6972593, and in the process produced 
certificates of reducibility (lists of smallest factors for each reducible trinomial). These are 
available from the first author's website [3]. 



x 



.6972593 



+ X' 



.3037958 



+ 1, 
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In April-August 2007, we ran our new algorithm to search for primitive trinomials of 
degree r = 24036583. This is the next Mersenne exponent, apart from two that are trivial 
to exclude by Swan's theorem. It would take about 41 times as long as for r = 6972593 
by the naive algorithm, but our new program is 560 times faster than the naive algorithm. 
Each trinomial takes on average about 16 seconds on a 2.2 Ghz Opteron. 

The complete computation was performed in four months, using about 24 Opteron and 
Core 2 processors located at ANU and INRIA. 

We found two new primitive trinomials of (equal) record degree: 

(6) ^24036583 + ^8412642 + j 

and 

^'j ^,24036583 + ^8785528 + ^ 

4.5. Verification. Allan Steel [TH] kindly verified irreducibility of ©-([7]) using Magma [2J. 
Each verification took about 67 hours on an 2.4 GHz Core 2 processor. Independent verifi- 
cations using our irred V3. 15 program [4j [6] took about 35 hours on a 2.2 Ghz Opteron. 
The difference in speed is mainly due to the fast squaring algorithm implemented in irred. 

Primitivity of © - ® follows from irreducibility provided that the degree 24036583 is a 
Mersenne exponent. We have not verified this, but rely on computations performed by the 
GIMPS project 

Rcducibility of the remaining trinomials of degree 24036583 can be verified using the 
certificate (or extended log, a list of smallest irreducible factors) available from our website [3]. 
The verification takes less than 10 hours using Magma on a 2.66 Ghz Core 2 processor. 

5. Conclusion 

The new double-blocking strategy, combined with fast multiplication and GCD algo- 
rithms, has allowed us to find new primitive trinomials of record degree. 

The same ideas should work over finite fields GF(p) for small prime p > 2, and for 
factoring sparse polynomials P(x) that are not necessarily trinomials: all we need is that 
the time for p-th powers (mod P(x)) is much less than the time for multiplication (mod 
P(x)). 
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